Variable-Span out-of-vocabulary named entity detection
نویسندگان
چکیده
Out-of-vocabulary named entities (OOV NEs) are always misrecognized by fixed-vocabulary automatic speech recognition (ASR) systems. This has a negative impact on downstream applications such as language understanding and machine translation (MT). Automatic detection of OOV NEs in ASR hypotheses can help mitigate this problem by triggering the use of alternative approaches to acquire and process these NEs. State-of-the-art OOV NE detection typically involves tagging ASR-hypothesized words using a sequence model, such as conditional random fields (CRF), in conjunction with a variety of contextual and ASR-derived features. In this paper, we propose a novel variable-span tagging approach for detecting OOV NEs. Instead of tagging individual words in ASR hypotheses, we directly tag longer spans of consecutive words. The proposed approach outperforms a state-of-the-art CRF tagger on two distinct heldout test sets with different OOV NE distributions. On a 5.1Kword test set rich in OOV NEs, our method achieves 56.1% detection rate at 10% false alarm rate (vs. 52.1% for the CRF detector). On a 39.4K-word test set with a natural distribution of OOV NEs, we obtain 73.0% detection rate at 10% false alarm rate (vs. 69.5% for the CRF detector). In all cases, OOV NEs are completely unobserved in our training data.
منابع مشابه
THE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...
متن کاملA combined Approach to Arabic Named Entity recognition Using SVM and Pattern Extracted method applied to Topic Detection
Named Entity Recognition (NER) is a clue task for automatic text processing that is required in a wide variety of applications. NER techniques range from handcrafted rules to machine learning approaches. In this paper, we describe the development and implementation of an Arabic Named Entity Recognition (ANER) System, based on machine learning approach. We used SVM classifier with a set of depen...
متن کاملA spoken term detection framework for recovering out-of-vocabulary words using the web
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We pro...
متن کاملOOV Sensitive Named-Entity Recognition in Speech
Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named e...
متن کاملMultilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than languagespecific words or characters, we can analyze text in many languages with a single model. Due to...
متن کامل